{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Linear Regression Tutorial\n",
    "by Marc Deisenroth"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The purpose of this notebook is to practice implementing some linear algebra (equations provided) and to explore some properties of linear regression."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import scipy.linalg\n",
    "import matplotlib.pyplot as plt\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We consider a linear regression problem of the form\n",
    "$$\n",
    "y = \\boldsymbol x^T\\boldsymbol\\theta + \\epsilon\\,,\\quad \\epsilon \\sim \\mathcal N(0, \\sigma^2)\n",
    "$$\n",
    "where $\\boldsymbol x\\in\\mathbb{R}^D$ are inputs and $y\\in\\mathbb{R}$ are noisy observations. The parameter vector $\\boldsymbol\\theta\\in\\mathbb{R}^D$ parametrizes the function.\n",
    "\n",
    "We assume we have a training set $(\\boldsymbol x_n, y_n)$, $n=1,\\ldots, N$. We summarize the sets of training inputs in $\\mathcal X = \\{\\boldsymbol x_1, \\ldots, \\boldsymbol x_N\\}$ and corresponding training targets $\\mathcal Y = \\{y_1, \\ldots, y_N\\}$, respectively.\n",
    "\n",
    "In this tutorial, we are interested in finding good parameters $\\boldsymbol\\theta$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Define training set\n",
    "X = np.array([-3, -1, 0, 1, 3]).reshape(-1,1) # 5x1 vector, N=5, D=1\n",
    "y = np.array([-1.2, -0.7, 0.14, 0.67, 1.67]).reshape(-1,1) # 5x1 vector\n",
    "\n",
    "# Plot the training set\n",
    "plt.figure()\n",
    "plt.plot(X, y, '+', markersize=10)\n",
    "plt.xlabel(\"$x$\")\n",
    "plt.ylabel(\"$y$\");"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Maximum Likelihood\n",
    "We will start with maximum likelihood estimation of the parameters $\\boldsymbol\\theta$. In maximum likelihood estimation, we find the parameters $\\boldsymbol\\theta^{\\mathrm{ML}}$ that maximize the likelihood\n",
    "$$\n",
    "p(\\mathcal Y | \\mathcal X, \\boldsymbol\\theta) = \\prod_{n=1}^N p(y_n | \\boldsymbol x_n, \\boldsymbol\\theta)\\,.\n",
    "$$\n",
    "From the lecture we know that the maximum likelihood estimator is given by\n",
    "$$\n",
    "\\boldsymbol\\theta^{\\text{ML}} = (\\boldsymbol X^T\\boldsymbol X)^{-1}\\boldsymbol X^T\\boldsymbol y\\in\\mathbb{R}^D\\,,\n",
    "$$\n",
    "where \n",
    "$$\n",
    "\\boldsymbol X = [\\boldsymbol x_1, \\ldots, \\boldsymbol x_N]^T\\in\\mathbb{R}^{N\\times D}\\,,\\quad \\boldsymbol y = [y_1, \\ldots, y_N]^T \\in\\mathbb{R}^N\\,.\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let us compute the maximum likelihood estimate for a given training set"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "## EDIT THIS FUNCTION\n",
    "def max_lik_estimate(X, y):\n",
    "    \n",
    "    # X: N x D matrix of training inputs\n",
    "    # y: N x 1 vector of training targets/observations\n",
    "    # returns: maximum likelihood parameters (D x 1)\n",
    "    \n",
    "    N, D = X.shape\n",
    "    theta_ml = np.zeros((D,1)) ## <-- EDIT THIS LINE\n",
    "    return theta_ml"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# get maximum likelihood estimate\n",
    "theta_ml = max_lik_estimate(X,y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, make a prediction using the maximum likelihood estimate that we just found"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "## EDIT THIS FUNCTION\n",
    "def predict_with_estimate(Xtest, theta):\n",
    "    \n",
    "    # Xtest: K x D matrix of test inputs\n",
    "    # theta: D x 1 vector of parameters\n",
    "    # returns: prediction of f(Xtest); K x 1 vector\n",
    "    \n",
    "    prediction = Xtest ## <-- EDIT THIS LINE\n",
    "    \n",
    "    return prediction "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, let's see whether we got something useful:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# define a test set\n",
    "Xtest = np.linspace(-5,5,100).reshape(-1,1) # 100 x 1 vector of test inputs\n",
    "\n",
    "# predict the function values at the test points using the maximum likelihood estimator\n",
    "ml_prediction = predict_with_estimate(Xtest, theta_ml)\n",
    "\n",
    "# plot\n",
    "plt.figure()\n",
    "plt.plot(X, y, '+', markersize=10)\n",
    "plt.plot(Xtest, ml_prediction)\n",
    "plt.xlabel(\"$x$\")\n",
    "plt.ylabel(\"$y$\");"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Questions\n",
    "1. Does the solution above look reasonable?\n",
    "2. Play around with different values of $\\theta$. How do the corresponding functions change?\n",
    "3. Modify the training targets $\\mathcal Y$ and re-run your computation. What changes?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let us now look at a different training set, where we add 2.0 to every $y$-value, and compute the maximum likelihood estimate"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "ynew = y + 2.0\n",
    "\n",
    "plt.figure()\n",
    "plt.plot(X, ynew, '+', markersize=10)\n",
    "plt.xlabel(\"$x$\")\n",
    "plt.ylabel(\"$y$\");"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# get maximum likelihood estimate\n",
    "theta_ml = max_lik_estimate(X, ynew)\n",
    "print(theta_ml)\n",
    "\n",
    "# define a test set\n",
    "Xtest = np.linspace(-5,5,100).reshape(-1,1) # 100 x 1 vector of test inputs\n",
    "\n",
    "# predict the function values at the test points using the maximum likelihood estimator\n",
    "ml_prediction = predict_with_estimate(Xtest, theta_ml)\n",
    "\n",
    "# plot\n",
    "plt.figure()\n",
    "plt.plot(X, ynew, '+', markersize=10)\n",
    "plt.plot(Xtest, ml_prediction)\n",
    "plt.xlabel(\"$x$\")\n",
    "plt.ylabel(\"$y$\");"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Question:\n",
    "1. This maximum likelihood estimate doesn't look too good: The orange line is too far away from the observations although we just shifted them by 2. Why is this the case?\n",
    "2. How can we fix this problem?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let us now define a linear regression model that is slightly more flexible:\n",
    "$$\n",
    "y = \\theta_0 + \\boldsymbol x^T \\boldsymbol\\theta_1 + \\epsilon\\,,\\quad \\epsilon\\sim\\mathcal N(0,\\sigma^2)\n",
    "$$\n",
    "Here, we added an offset (bias) parameter $\\theta_0$ to our original model."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Question:\n",
    "1. What is the effect of this bias parameter, i.e., what additional flexibility does it offer?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we now define the inputs to be the augmented vector $\\boldsymbol x_{\\text{aug}} = \\begin{bmatrix}1\\\\\\boldsymbol x\\end{bmatrix}$, we can write the new linear regression model as \n",
    "$$\n",
    "y = \\boldsymbol x_{\\text{aug}}^T\\boldsymbol\\theta_{\\text{aug}} + \\epsilon\\,,\\quad \\boldsymbol\\theta_{\\text{aug}} = \\begin{bmatrix}\n",
    "\\theta_0\\\\\n",
    "\\boldsymbol\\theta_1\n",
    "\\end{bmatrix}\\,.\n",
    "$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "N, D = X.shape\n",
    "X_aug = np.hstack([np.ones((N,1)), X]) # augmented training inputs of size N x (D+1)\n",
    "theta_aug = np.zeros((D+1, 1)) # new theta vector of size (D+1) x 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let us now compute the maximum likelihood estimator for this setting.\n",
    "_Hint:_ If possible, re-use code that you have already written"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "## EDIT THIS FUNCTION\n",
    "def max_lik_estimate_aug(X_aug, y):\n",
    "    \n",
    "    theta_aug_ml = np.zeros((D+1,1)) ## <-- EDIT THIS LINE\n",
    "    \n",
    "    return theta_aug_ml"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "theta_aug_ml = max_lik_estimate_aug(X_aug, y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, we can make predictions again:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# define a test set (we also need to augment the test inputs with ones)\n",
    "Xtest_aug = np.hstack([np.ones((Xtest.shape[0],1)), Xtest]) # 100 x (D + 1) vector of test inputs\n",
    "\n",
    "# predict the function values at the test points using the maximum likelihood estimator\n",
    "ml_prediction = predict_with_estimate(Xtest_aug, theta_aug_ml)\n",
    "\n",
    "# plot\n",
    "plt.figure()\n",
    "plt.plot(X, y, '+', markersize=10)\n",
    "plt.plot(Xtest, ml_prediction)\n",
    "plt.xlabel(\"$x$\")\n",
    "plt.ylabel(\"$y$\");"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It seems this has solved our problem! \n",
    "#### Question:\n",
    "1. Play around with the first parameter of $\\boldsymbol\\theta_{\\text{aug}}$ and see how the fit of the function changes.\n",
    "2. Play around with the second parameter of $\\boldsymbol\\theta_{\\text{aug}}$ and see how the fit of the function changes."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Nonlinear Features\n",
    "So far, we have looked at linear regression with linear features. This allowed us to fit straight lines. However, linear regression also allows us to fit functions that are nonlinear in the inputs $\\boldsymbol x$, as long as the parameters $\\boldsymbol\\theta$ appear linearly. This means, we can learn functions of the form\n",
    "$$\n",
    "f(\\boldsymbol x, \\boldsymbol\\theta) = \\sum_{k = 1}^K \\theta_k \\phi_k(\\boldsymbol x)\\,,\n",
    "$$\n",
    "where the features $\\phi_k(\\boldsymbol x)$ are (possibly nonlinear) transformations of the inputs $\\boldsymbol x$.\n",
    "\n",
    "Let us have a look at an example where the observations clearly do not lie on a straight line:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "y = np.array([10.05, 1.5, -1.234, 0.02, 8.03]).reshape(-1,1)\n",
    "plt.figure()\n",
    "plt.plot(X, y, '+')\n",
    "plt.xlabel(\"$x$\")\n",
    "plt.ylabel(\"$y$\");"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Polynomial Regression\n",
    "One class of functions that is covered by linear regression is the family of polynomials because we can write a polynomial of degree $K$ as\n",
    "$$\n",
    "\\sum_{k=0}^K \\theta_k x^k = \\boldsymbol \\phi(x)^T\\boldsymbol\\theta\\,,\\quad\n",
    "\\boldsymbol\\phi(x)= \n",
    "\\begin{bmatrix}\n",
    "x^0\\\\\n",
    "x^1\\\\\n",
    "\\vdots\\\\\n",
    "x^K\n",
    "\\end{bmatrix}\\in\\mathbb{R}^{K+1}\\,.\n",
    "$$\n",
    "Here, $\\boldsymbol\\phi(x)$ is a nonlinear feature transformation of the inputs $x\\in\\mathbb{R}$.\n",
    "\n",
    "Similar to the earlier case we can define a matrix that collects all the feature transformations of the training inputs:\n",
    "$$\n",
    "\\boldsymbol\\Phi = \\begin{bmatrix}\n",
    "\\boldsymbol\\phi(x_1) & \\boldsymbol\\phi(x_2) & \\cdots & \\boldsymbol\\phi(x_n)\n",
    "\\end{bmatrix}^T \\in\\mathbb{R}^{N\\times K+1}\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let us start by computing the feature matrix $\\boldsymbol \\Phi$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "## EDIT THIS FUNCTION\n",
    "def poly_features(X, K):\n",
    "    \n",
    "    # X: inputs of size N x 1\n",
    "    # K: degree of the polynomial\n",
    "    # computes the feature matrix Phi (N x (K+1))\n",
    "    \n",
    "    X = X.flatten()\n",
    "    N = X.shape[0]\n",
    "    \n",
    "    #initialize Phi\n",
    "    Phi = np.zeros((N, K+1))\n",
    "    \n",
    "    # Compute the feature matrix in stages\n",
    "    Phi = np.zeros((N, K+1)) ## <-- EDIT THIS LINE\n",
    "    return Phi"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With this feature matrix we get the maximum likelihood estimator as\n",
    "$$\n",
    "\\boldsymbol \\theta^\\text{ML} = (\\boldsymbol\\Phi^T\\boldsymbol\\Phi)^{-1}\\boldsymbol\\Phi^T\\boldsymbol y\n",
    "$$\n",
    "For reasons of numerical stability, we often add a small diagonal \"jitter\" $\\kappa>0$ to $\\boldsymbol\\Phi^T\\boldsymbol\\Phi$ so that we can invert the matrix without significant problems so that the maximum likelihood estimate becomes\n",
    "$$\n",
    "\\boldsymbol \\theta^\\text{ML} = (\\boldsymbol\\Phi^T\\boldsymbol\\Phi + \\kappa\\boldsymbol I)^{-1}\\boldsymbol\\Phi^T\\boldsymbol y\n",
    "$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "## EDIT THIS FUNCTION\n",
    "def nonlinear_features_maximum_likelihood(Phi, y):\n",
    "    # Phi: features matrix for training inputs. Size of N x D\n",
    "    # y: training targets. Size of N by 1\n",
    "    # returns: maximum likelihood estimator theta_ml. Size of D x 1\n",
    "    \n",
    "    kappa = 1e-08 # 'jitter' term; good for numerical stability\n",
    "    \n",
    "    D = Phi.shape[1]  \n",
    "    \n",
    "    # maximum likelihood estimate\n",
    "    theta_ml = np.zeros((D,1)) ## <-- EDIT THIS LINE\n",
    "  \n",
    "    \n",
    "    return theta_ml"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we have all the ingredients together: The computation of the feature matrix and the computation of the maximum likelihood estimator for polynomial regression. Let's see how this works.\n",
    "\n",
    "To make predictions at test inputs $\\boldsymbol X_{\\text{test}}\\in\\mathbb{R}$, we need to compute the features (nonlinear transformations) $\\boldsymbol\\Phi_{\\text{test}}= \\boldsymbol\\phi(\\boldsymbol X_{\\text{test}})$ of $\\boldsymbol X_{\\text{test}}$ to give us the predicted mean\n",
    "$$\n",
    "\\mathbb{E}[\\boldsymbol y_{\\text{test}}] = \\boldsymbol \\Phi_{\\text{test}}\\boldsymbol\\theta^{\\text{ML}}\n",
    "$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "K = 5 # Define the degree of the polynomial we wish to fit\n",
    "Phi = poly_features(X, K) # N x (K+1) feature matrix\n",
    "\n",
    "theta_ml = nonlinear_features_maximum_likelihood(Phi, y) # maximum likelihood estimator\n",
    "\n",
    "# test inputs\n",
    "Xtest = np.linspace(-4,4,100).reshape(-1,1)\n",
    "\n",
    "# feature matrix for test inputs\n",
    "Phi_test = poly_features(Xtest, K)\n",
    "\n",
    "y_pred = Phi_test @ theta_ml # predicted y-values\n",
    "\n",
    "plt.figure()\n",
    "plt.plot(X, y, '+')\n",
    "plt.plot(Xtest, y_pred)\n",
    "plt.xlabel(\"$x$\")\n",
    "plt.ylabel(\"$y$\");"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Experiment with different polynomial degrees in the code above.\n",
    "#### Questions:\n",
    "1. What do you observe?\n",
    "2. What is a good fit?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Evaluating the Quality of the Model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let us have a look at a more interesting data set"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def f(x):   \n",
    "    return np.cos(x) + 0.2*np.random.normal(size=(x.shape))\n",
    "\n",
    "X = np.linspace(-4,4,20).reshape(-1,1)\n",
    "y = f(X)\n",
    "\n",
    "plt.figure()\n",
    "plt.plot(X, y, '+')\n",
    "plt.xlabel(\"$x$\")\n",
    "plt.ylabel(\"$y$\");"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, let us use the work from above and fit polynomials to this dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "## EDIT THIS CELL\n",
    "K = 2 # Define the degree of the polynomial we wish to fit\n",
    "\n",
    "Phi = poly_features(X, K) # N x (K+1) feature matrix\n",
    "\n",
    "theta_ml = nonlinear_features_maximum_likelihood(Phi, y) # maximum likelihood estimator\n",
    "\n",
    "# test inputs\n",
    "Xtest = np.linspace(-5,5,100).reshape(-1,1)\n",
    "ytest = f(Xtest) # ground-truth y-values\n",
    "\n",
    "# feature matrix for test inputs\n",
    "Phi_test = poly_features(Xtest, K)\n",
    "\n",
    "y_pred = Xtest*0 # <-- EDIT THIS LINE\n",
    "\n",
    "# plot\n",
    "plt.figure()\n",
    "plt.plot(X, y, '+')\n",
    "plt.plot(Xtest, y_pred)\n",
    "plt.plot(Xtest, ytest)\n",
    "plt.legend([\"data\", \"prediction\", \"ground truth observations\"])\n",
    "plt.xlabel(\"$x$\")\n",
    "plt.ylabel(\"$y$\");"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Questions:\n",
    "1. Try out different degrees of polynomials. \n",
    "2. Based on visual inspection, what looks like the best fit?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let us now look at a more systematic way to assess the quality of the polynomial that we are trying to fit. For this, we compute the root-mean-squared-error (RMSE) between the $y$-values predicted by our polynomial and the ground-truth $y$-values. The RMSE is then defined as\n",
    "$$\n",
    "\\text{RMSE} = \\sqrt{\\frac{1}{N}\\sum_{n=1}^N(y_n - y_n^\\text{pred})^2}\n",
    "$$\n",
    "Write a function that computes the RMSE."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "## EDIT THIS FUNCTION\n",
    "def RMSE(y, ypred):\n",
    "    rmse = -1 ## <-- EDIT THIS LINE\n",
    "    return rmse"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now compute the RMSE for different degrees of the polynomial we want to fit."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "## EDIT THIS CELL\n",
    "K_max = 20\n",
    "rmse_train = np.zeros((K_max+1,))\n",
    "\n",
    "for k in range(K_max+1):\n",
    "    \n",
    "     \n",
    "    rmse_train[k] = -1 # <-- EDIT THIS LINE\n",
    "        \n",
    "\n",
    "plt.figure()\n",
    "plt.plot(rmse_train)\n",
    "plt.xlabel(\"degree of polynomial\")\n",
    "plt.ylabel(\"RMSE\");"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Question: \n",
    "1. What do you observe?\n",
    "2. What is the best polynomial fit according to this plot?\n",
    "3. Write some code that plots the function that uses the best polynomial degree (use the test set for this plot). What do you observe now?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# WRITE THE PLOTTING CODE HERE\n",
    "plt.figure()\n",
    "plt.plot(X, y, '+')\n",
    "ypred_test = Xtest*0 ## <--- EDIT THIS LINE (hint: you may require a few lines to do the computation)\n",
    "\n",
    "plt.plot(Xtest, ypred_test) \n",
    "plt.xlabel(\"$x$\")\n",
    "plt.ylabel(\"$y$\")\n",
    "plt.legend([\"data\", \"maximum likelihood fit\"]);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The RMSE on the training data is somewhat misleading, because we are interested in the generalization performance of the model. Therefore, we are going to compute the RMSE on the test set and use this to choose a good polynomial degree."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "## EDIT THIS CELL\n",
    "K_max = 20\n",
    "rmse_train = np.zeros((K_max+1,))\n",
    "rmse_test = np.zeros((K_max+1,))\n",
    "\n",
    "for k in range(K_max+1):\n",
    "    \n",
    "    # feature matrix\n",
    "    Phi = 0 ## <--- EDIT THIS LINE\n",
    "    \n",
    "    # maximum likelihood estimate\n",
    "    theta_ml = 0 ## <--- EDIT THIS LINE\n",
    "    \n",
    "    # predict y-values of training set\n",
    "    ypred_train = 0 ## <--- EDIT THIS LINE\n",
    "    \n",
    "    # RMSE on training set\n",
    "    rmse_train[k] = 0 ## <--- EDIT THIS LINE\n",
    "            \n",
    "    # feature matrix for test inputs\n",
    "    Phi_test = 0 ## <--- EDIT THIS LINE\n",
    "    \n",
    "    # prediction (test set)\n",
    "    ypred_test = 0 ## <--- EDIT THIS LINE\n",
    "    \n",
    "    # RMSE on test set\n",
    "    rmse_test[k] = -1 ## <--- EDIT THIS LINE\n",
    "    \n",
    "\n",
    "plt.figure()\n",
    "plt.semilogy(rmse_train) # this plots the RMSE on a logarithmic scale\n",
    "plt.semilogy(rmse_test) # this plots the RMSE on a logarithmic scale\n",
    "plt.xlabel(\"degree of polynomial\")\n",
    "plt.ylabel(\"RMSE\")\n",
    "plt.legend([\"training set\", \"test set\"]);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Questions:\n",
    "1. What do you observe now?\n",
    "2. Why does the RMSE for the test set not always go down?\n",
    "3. Which polynomial degree would you choose now?\n",
    "4. Plot the fit for the \"best\" polynomial degree."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# WRITE THE PLOTTING CODE HERE\n",
    "plt.figure()\n",
    "plt.plot(X, y, '+')\n",
    "ypred_test = Xtest*0 ## <--- EDIT THIS LINE (hint: you may require a few lines to do the computation)\n",
    "\n",
    "plt.plot(Xtest, ypred_test) \n",
    "plt.xlabel(\"$x$\")\n",
    "plt.ylabel(\"$y$\")\n",
    "plt.legend([\"data\", \"maximum likelihood fit\"]);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Question\n",
    "If you did not have a designated test set, what could you do to estimate the generalization error (purely using the training set)?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Maximum A Posteriori Estimation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We are still considering the model\n",
    "$$\n",
    "y = \\boldsymbol\\phi(\\boldsymbol x)^T\\boldsymbol\\theta + \\epsilon\\,,\\quad \\epsilon\\sim\\mathcal N(0,\\sigma^2)\\,.\n",
    "$$\n",
    "We assume that the noise variance $\\sigma^2$ is known."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Instead of maximizing the likelihood, we can look at the maximum of the posterior distribution on the parameters $\\boldsymbol\\theta$, which is given as\n",
    "$$\n",
    "p(\\boldsymbol\\theta|\\mathcal X, \\mathcal Y) = \\frac{\\overbrace{p(\\mathcal Y|\\mathcal X, \\boldsymbol\\theta)}^{\\text{likelihood}}\\overbrace{p(\\boldsymbol\\theta)}^{\\text{prior}}}{\\underbrace{p(\\mathcal Y|\\mathcal X)}_{\\text{evidence}}}\n",
    "$$\n",
    "The purpose of the parameter prior $p(\\boldsymbol\\theta)$ is to discourage the parameters to attain extreme values, a sign that the model overfits. The prior allows us to specify a \"reasonable\" range of parameter values. Typically, we choose a Gaussian prior $\\mathcal N(\\boldsymbol 0, \\alpha^2\\boldsymbol I)$, centered at $\\boldsymbol 0$ with variance $\\alpha^2$ along each parameter dimension."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The MAP estimate of the parameters is\n",
    "$$\n",
    "\\boldsymbol\\theta^{\\text{MAP}} = (\\boldsymbol\\Phi^T\\boldsymbol\\Phi + \\frac{\\sigma^2}{\\alpha^2}\\boldsymbol I)^{-1}\\boldsymbol\\Phi^T\\boldsymbol y\n",
    "$$\n",
    "where $\\sigma^2$ is the variance of the noise."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "## EDIT THIS FUNCTION\n",
    "def map_estimate_poly(Phi, y, sigma, alpha):\n",
    "    # Phi: training inputs, Size of N x D\n",
    "    # y: training targets, Size of D x 1\n",
    "    # sigma: standard deviation of the noise \n",
    "    # alpha: standard deviation of the prior on the parameters\n",
    "    # returns: MAP estimate theta_map, Size of D x 1\n",
    "    \n",
    "    D = Phi.shape[1] \n",
    "    \n",
    "    theta_map = np.zeros((D,1)) ## <-- EDIT THIS LINE\n",
    "    \n",
    "    return theta_map"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# define the function we wish to estimate later\n",
    "def g(x, sigma):\n",
    "    p = np.hstack([x**0, x**1, np.sin(x)])\n",
    "    w = np.array([-1.0, 0.1, 1.0]).reshape(-1,1)\n",
    "    return p @ w + sigma*np.random.normal(size=x.shape) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Generate some data\n",
    "sigma = 1.0 # noise standard deviation\n",
    "alpha = 1.0 # standard deviation of the parameter prior\n",
    "N = 20\n",
    "\n",
    "np.random.seed(42)\n",
    "\n",
    "X = (np.random.rand(N)*10.0 - 5.0).reshape(-1,1)\n",
    "y = g(X, sigma) # training targets\n",
    "\n",
    "plt.figure()\n",
    "plt.plot(X, y, '+')\n",
    "plt.xlabel(\"$x$\")\n",
    "plt.ylabel(\"$y$\");"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# get the MAP estimate\n",
    "K = 8 # polynomial degree   \n",
    "\n",
    "\n",
    "# feature matrix\n",
    "Phi = poly_features(X, K)\n",
    "\n",
    "theta_map = map_estimate_poly(Phi, y, sigma, alpha)\n",
    "\n",
    "# maximum likelihood estimate\n",
    "theta_ml = nonlinear_features_maximum_likelihood(Phi, y)\n",
    "\n",
    "Xtest = np.linspace(-5,5,100).reshape(-1,1)\n",
    "ytest = g(Xtest, sigma)\n",
    "\n",
    "Phi_test = poly_features(Xtest, K)\n",
    "y_pred_map = Phi_test @ theta_map\n",
    "\n",
    "y_pred_mle = Phi_test @ theta_ml\n",
    "\n",
    "plt.figure()\n",
    "plt.plot(X, y, '+')\n",
    "plt.plot(Xtest, y_pred_map)\n",
    "plt.plot(Xtest, g(Xtest, 0))\n",
    "plt.plot(Xtest, y_pred_mle)\n",
    "\n",
    "plt.legend([\"data\", \"map prediction\", \"ground truth function\", \"maximum likelihood\"]);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "print(np.hstack([theta_ml, theta_map]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, let us compute the RMSE for different polynomial degrees and see whether the MAP estimate addresses the overfitting issue we encountered with the maximum likelihood estimate."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "## EDIT THIS CELL\n",
    "\n",
    "K_max = 12 # this is the maximum degree of polynomial we will consider\n",
    "assert(K_max < N) # this is the latest point when we'll run into numerical problems\n",
    "\n",
    "rmse_mle = np.zeros((K_max+1,))\n",
    "rmse_map = np.zeros((K_max+1,))\n",
    "\n",
    "for k in range(K_max+1):\n",
    "   \n",
    "    rmse_mle[k] = -1 ## Compute the maximum likelihood estimator, compute the test-set predicitons, compute the RMSE\n",
    "    rmse_map[k] = -1 ## Compute the MAP estimator, compute the test-set predicitons, compute the RMSE\n",
    "\n",
    "plt.figure()\n",
    "plt.semilogy(rmse_mle) # this plots the RMSE on a logarithmic scale\n",
    "plt.semilogy(rmse_map) # this plots the RMSE on a logarithmic scale\n",
    "plt.xlabel(\"degree of polynomial\")\n",
    "plt.ylabel(\"RMSE\")\n",
    "plt.legend([\"Maximum likelihood\", \"MAP\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Questions:\n",
    "1. What do you observe?\n",
    "2. What is the influence of the prior variance on the parameters ($\\alpha^2$)? Change the parameter and describe what happens."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Bayesian Linear Regression"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Test inputs\n",
    "Ntest = 200\n",
    "Xtest = np.linspace(-5, 5, Ntest).reshape(-1,1) # test inputs\n",
    "\n",
    "prior_var = 2.0 # variance of the parameter prior (alpha^2). We assume this is known.\n",
    "noise_var = 1.0 # noise variance (sigma^2). We assume this is known.\n",
    "\n",
    "pol_deg = 3 # degree of the polynomial we consider at the moment"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Assume a parameter prior $p(\\boldsymbol\\theta) = \\mathcal N (\\boldsymbol 0, \\alpha^2\\boldsymbol I)$. For every test input $\\boldsymbol x_*$ we obtain the \n",
    "prior mean\n",
    "$$\n",
    "E[f(\\boldsymbol x_*)] = 0\n",
    "$$\n",
    "and the prior (marginal) variance (ignoring the noise contribution)\n",
    "$$\n",
    "V[f(\\boldsymbol x_*)] = \\alpha^2\\boldsymbol\\phi(\\boldsymbol x_*) \\boldsymbol\\phi(\\boldsymbol x_*)^\\top\n",
    "$$\n",
    "where $\\boldsymbol\\phi(\\cdot)$ is the feature map."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "## EDIT THIS CELL\n",
    "\n",
    "# compute the feature matrix for the test inputs\n",
    "Phi_test = np.zeros((Ntest, pol_deg+1))  # N x (pol_deg+1) feature matrix <--- EDIT THIS LINE\n",
    "raise NotImplementedError\n",
    "\n",
    "# compute the (marginal) prior at the test input locations\n",
    "# prior mean\n",
    "prior_mean = np.ones((Ntest,1))  # prior mean at test inputs (size: (Ntest,1)) <-- EDIT THIS LINE\n",
    "raise NotImplementedError\n",
    "\n",
    "# prior variance\n",
    "full_covariance = np.zeros((Ntest, Ntest)) # N x N covariance matrix of all function values <-- EDIT THIS LINE\n",
    "prior_marginal_var = 0 # marginal of size (N, )\n",
    "raise NotImplementedError\n",
    "\n",
    "# Let us visualize the prior over functions\n",
    "plt.figure()\n",
    "plt.plot(Xtest, prior_mean, color=\"k\")\n",
    "\n",
    "conf_bound1 = np.sqrt(prior_marginal_var).flatten()\n",
    "conf_bound2 = 2.0*np.sqrt(prior_marginal_var).flatten()\n",
    "conf_bound3 = 2.0*np.sqrt(prior_marginal_var + noise_var).flatten()\n",
    "plt.fill_between(Xtest.flatten(), prior_mean.flatten() + conf_bound1, \n",
    "             prior_mean.flatten() - conf_bound1, alpha = 0.1, color=\"k\")\n",
    "plt.fill_between(Xtest.flatten(), prior_mean.flatten() + conf_bound2, \n",
    "                 prior_mean.flatten() - conf_bound2, alpha = 0.1, color=\"k\")\n",
    "plt.fill_between(Xtest.flatten(), prior_mean.flatten() + conf_bound3, \n",
    "                 prior_mean.flatten() - conf_bound3, alpha = 0.1, color=\"k\")\n",
    "\n",
    "plt.xlabel('$x$')\n",
    "plt.ylabel('$y$')\n",
    "plt.title(\"Prior over functions\");"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, we will use this prior distribution and sample functions from it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "## EDIT THIS CELL\n",
    "\n",
    "# samples from the prior\n",
    "num_samples = 10\n",
    "\n",
    "# We first need to generate random weights theta_i, which we sample from the parameter prior\n",
    "random_weights = np.random.normal(size=(pol_deg+1,num_samples), scale=np.sqrt(prior_var))\n",
    "\n",
    "# Now, we compute the induced random functions, evaluated at the test input locations\n",
    "# Every function sample is given as f_i = Phi * theta_i, \n",
    "# where theta_i is a sample from the parameter prior\n",
    "\n",
    "sample_function = np.zeros((Ntest,)) # <-- EDIT THIS LINE\n",
    "raise NotImplementedError\n",
    "\n",
    "plt.figure()\n",
    "plt.plot(Xtest, sample_function, color=\"r\")\n",
    "plt.title(\"Plausible functions under the prior\")\n",
    "print(\"Every sampled function is a polynomial of degree \"+str(pol_deg));"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we are given some training inputs $\\boldsymbol x_1, \\dotsc, \\boldsymbol x_N$, which we collect in a matrix $\\boldsymbol X = [\\boldsymbol x_1, \\dotsc, \\boldsymbol x_N]^\\top\\in\\mathbb{R}^{N\\times D}$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "N = 10\n",
    "X = np.random.uniform(high=5, low=-5, size=(N,1)) # training inputs, size Nx1\n",
    "y = g(X, np.sqrt(noise_var)) # training targets, size Nx1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, let us compute the posterior "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "## EDIT THIS FUNCTION\n",
    "\n",
    "def polyfit(X, y, K, prior_var, noise_var):\n",
    "    # X: training inputs, size N x D\n",
    "    # y: training targets, size N x 1\n",
    "    # K: degree of polynomial we consider\n",
    "    # prior_var: prior variance of the parameter distribution\n",
    "    # sigma: noise variance\n",
    "    \n",
    "    jitter = 1e-08 # increases numerical stability\n",
    "    \n",
    "    Phi = poly_features(X, K) # N x (K+1) feature matrix \n",
    "    \n",
    "    # Compute maximum likelihood estimate\n",
    "    theta_ml = np.zeros((K+1,1)) # <-- EDIT THIS LINE \n",
    "    \n",
    "    # MAP estimate\n",
    "    theta_map = np.zeros((K+1,1)) # <-- EDIT THIS LINE \n",
    "    \n",
    "    # Parameter posterior\n",
    "    SN = np.zeros(K+1) # covariance matrix of the parameter posterior # <-- EDIT THIS LINE \n",
    "    mN = np.zeros((K+1,1)) # mean vector of the parameter posterior   # <-- EDIT THIS LINE \n",
    "    \n",
    "    raise NotImplementedError\n",
    "    \n",
    "    return (theta_ml, theta_map, mN, SN)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "theta_ml, theta_map, theta_mean, theta_var = polyfit(X, y, pol_deg, alpha, sigma)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, let's make predictions (ignoring the measurement noise). We obtain three predictors:\n",
    "\\begin{align}\n",
    "&\\text{Maximum likelihood: }E[f(\\boldsymbol X_{\\text{test}})] = \\boldsymbol \\phi(X_{\\text{test}})\\boldsymbol \\theta_{ml}\\\\\n",
    "&\\text{Maximum a posteriori: } E[f(\\boldsymbol X_{\\text{test}})] = \\boldsymbol \\phi(X_{\\text{test}})\\boldsymbol \\theta_{map}\\\\\n",
    "&\\text{Bayesian: } p(f(\\boldsymbol X_{\\text{test}})) = \\mathcal N(f(\\boldsymbol X_{\\text{test}}) \\,|\\, \\boldsymbol \\phi(X_{\\text{test}}) \\boldsymbol\\theta_{\\text{mean}},\\, \\boldsymbol\\phi(X_{\\text{test}}) \\boldsymbol\\theta_{\\text{var}}  \\boldsymbol\\phi(X_{\\text{test}})^\\top)\n",
    "\\end{align}\n",
    "We already computed all quantities. Write some code that implements all three predictors."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "## EDIT THIS CELL\n",
    "\n",
    "# predictions (ignoring the measurement/observations noise)\n",
    "\n",
    "# maximum likelihood predictions (just the mean)\n",
    "m_mle_test = np.zeros((Ntest,1)) # <-- EDIT THIS LINE\n",
    "\n",
    "# MAP predictions (just the mean)\n",
    "m_map_test = np.zeros((Ntest,1)) # <-- EDIT THIS LINE\n",
    "\n",
    "# predictive distribution (Bayesian linear regression)\n",
    "# mean prediction\n",
    "mean_blr = np.zeros((Ntest,1)) # <-- EDIT THIS LINE\n",
    "# variance prediction\n",
    "cov_blr =  np.ones((Ntest,Ntest)) # <-- EDIT THIS LINE\n",
    "\n",
    "raise NotImplementedError"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# plot the posterior\n",
    "plt.figure()\n",
    "plt.plot(X, y, \"+\")\n",
    "plt.plot(Xtest, m_mle_test)\n",
    "plt.plot(Xtest, m_map_test)\n",
    "var_blr = np.diag(cov_blr)\n",
    "conf_bound1 = np.sqrt(var_blr).flatten()\n",
    "conf_bound2 = 2.0*np.sqrt(var_blr).flatten()\n",
    "conf_bound3 = 2.0*np.sqrt(var_blr + sigma).flatten()\n",
    "\n",
    "plt.fill_between(Xtest.flatten(), mean_blr.flatten() + conf_bound1, \n",
    "                 mean_blr.flatten() - conf_bound1, alpha = 0.1, color=\"k\")\n",
    "plt.fill_between(Xtest.flatten(), mean_blr.flatten() + conf_bound2, \n",
    "                 mean_blr.flatten() - conf_bound2, alpha = 0.1, color=\"k\")\n",
    "plt.fill_between(Xtest.flatten(), mean_blr.flatten() + conf_bound3, \n",
    "                 mean_blr.flatten() - conf_bound3, alpha = 0.1, color=\"k\")\n",
    "plt.legend([\"Training data\", \"MLE\", \"MAP\", \"BLR\"])\n",
    "plt.xlabel('$x$');\n",
    "plt.ylabel('$y$');"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
